In this vignette I illustrate the key steps required to solve the exercises. By no means I am trying to provide a detailed report of the analysis of each dataset. I am trying to put emphasis on different things you could do in the different exercise. Any final real report would need additional analyses and results would have to be written in good English, not in telegraphic style or lines of codes. I have also written this vignette very quickly, so double check I did not mess anything up…
Note: I will not check the assumptions for LM, nor detail the dataset (and plotting variables) since we did it before.
We fit the LM:
We fit the LM on the BoxCox transformed response:
InsectSprays$count_bis <- InsectSprays$count + 1
mod_insect_bis <- update(mod_insect, count_bis ~ .)
bc <- car::powerTransform(mod_insect_bis)
InsectSprays$count_bc <- car::bcPower(InsectSprays$count_bis, bc$lambda)
mod_insect_bc <- update(mod_insect_bis, count_bc ~ .)We fit the GLM:
All assumptions about model structure have been checked before (see exercises for LM).
We still have to test for lack of serial auto-correlation and independance:
library(DHARMa)
insect_glm_resid <- simulateResiduals(mod_insect_glm, refit = TRUE, n = 1000)
plot(insect_glm_resid)## DHARMa::plotResiduals - low number of unique predictor values, consider setting asFactor = T
## DHARMa::plotResiduals - low number of unique predictor values, consider setting asFactor = T
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.094556, p-value = 0.5404
## alternative hypothesis: two-sided
## Warning in plotResiduals(pred = x, residuals = NULL, xlab = "Predicted value (rank transformed)", : DHARMa::plotResiduals - predictor is a factor, rank = T has no effect
##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 1.8008, p-value = 0.1975
## alternative hypothesis: true autocorrelation is greater than 0
All look good. We also have to test for the goodness of fit:
##
## DHARMa nonparametric overdispersion test via comparison to simulation under H0 = fitted model
##
## data: insect_glm_resid
## dispersion = 1.5104, p-value = 0.012
## alternative hypothesis: both
There seem to be slight overdispersion :-(
Is the overdispersion caused by two many zeros?
##
## 0 1 2 3 4 5 6 7 9 10 11 12 13 14 15 16 17 19 20 21 22 23 24 26
## 2 6 4 8 4 7 3 3 1 3 3 2 4 4 2 2 4 1 2 2 1 1 1 2
##
## DHARMa zero-inflation test via comparison to expected zeros with simulation under H0 = fitted model
##
## data: insect_glm_resid
## ratioObsExp = 1.0521, p-value = 0.293
## alternative hypothesis: more
nope…
Let’s try to fit alternative models to account for the overdispersion:
mod_insect_glm_quasi <- glm(count ~ spray, data = InsectSprays, family = quasipoisson())
library(MASS)
mod_insect_glm_nb <- glm.nb(count ~ spray, data = InsectSprays)
rbind(deviance(mod_insect_glm),
deviance(mod_insect_glm_quasi),
deviance(mod_insect_glm_nb))## [,1]
## [1,] 98.32866
## [2,] 98.32866
## [3,] 74.14497
Note that the prediction are the same:
plot(predict(mod_insect_glm, type = "response"), mod_insect_glm$model$count)
abline(0, 1, col = "red")plot(predict(mod_insect_glm_quasi, type = "response"), mod_insect_glm_quasi$model$count)
abline(0, 1, col = "red")plot(predict(mod_insect_glm_nb, type = "response"), mod_insect_glm_nb$model$count)
abline(0, 1, col = "red")The only thing that differ is the variance around those predictions.
Note also that the gaussian models have deviance expressed on another scale, which makes the comparison between all models particularly difficult.
pred_bc <- ((predict(mod_insect_bc) * bc$lambda) + 1)^(1/bc$lambda) - 1
(deviance_LM_bc <- sum((pred_bc - InsectSprays$count)^2))## [1] 1028.846
## [1] 1015.167
We can compare the poisson model with the negative binomial one very simply:
## Warning in modelUpdate(objects[[i - 1]], objects[[i]]): original model was of class "negbin", updated model is of class "glm"
## Likelihood ratio test
##
## Model 1: count ~ spray
## Model 2: count ~ spray
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 7 -180.11
## 2 6 -182.29 -1 4.3713 0.03655 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let us stick to the negative binomial model as our best GLM!
insect_glm_nb_resid <- simulateResiduals(mod_insect_glm_nb, refit = TRUE, n = 1000)
testUniformity(insect_glm_nb_resid)##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.076, p-value = 0.7999
## alternative hypothesis: two-sided
##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 2.1796, p-value = 0.7783
## alternative hypothesis: true autocorrelation is greater than 0
##
## DHARMa nonparametric overdispersion test via comparison to simulation under H0 = fitted model
##
## data: insect_glm_nb_resid
## dispersion = 1.0673, p-value = 0.408
## alternative hypothesis: both
##
## DHARMa zero-inflation test via comparison to expected zeros with simulation under H0 = fitted model
##
## data: insect_glm_nb_resid
## ratioObsExp = 0.89646, p-value = 0.383
## alternative hypothesis: more
Everything looks very good!
## Analysis of Variance Table
##
## Model 1: count ~ spray
## Model 2: count ~ 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 66 1015.2
## 2 71 3684.0 -5 -2668.8 34.702 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: count_bc ~ spray
## Model 2: count_bc ~ 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 66 31.998
## 2 71 150.337 -5 -118.34 48.818 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Deviance Table
##
## Model 1: count ~ spray
## Model 2: count ~ 1
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 66 98.33
## 2 71 409.04 -5 -310.71 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Likelihood ratio tests of Negative Binomial Models
##
## Response: count
## Model theta Resid. df 2 x log-lik. Test df LR stat. Pr(Chi)
## 1 1 1.736021 71 -467.9604
## 2 spray 28.099507 66 -360.2179 1 vs 2 5 107.7425 0
No matter the model the effect of the insecticed is detected!
We now compare the predictions and CI between the different models:
data.for.pred <- data.frame(spray = levels(InsectSprays$spray))
pred_lm <- predict(mod_insect, newdata = data.for.pred, interval = "confidence")
pred_bc <- predict(mod_insect_bc, newdata = data.for.pred, interval = "confidence")
unboxcox <- function(x, lambda) (x*lambda + 1)^(1/lambda)
pred_bc_unboxcox <- unboxcox(x = pred_bc, lambda = bc$lambda) - 1 ## we remove one as we added one so to be able to do the BoxCox
pred_glm_eta <- predict(mod_insect_glm, newdata = data.for.pred, se.fit = TRUE)
pred_glm_table <- data.frame(fit = exp(pred_glm_eta$fit))
pred_glm_table$lwr <- exp(pred_glm_eta$fit + qnorm(0.025)*pred_glm_eta$se.fit)
pred_glm_table$upr <- exp(pred_glm_eta$fit + qnorm(0.975)*pred_glm_eta$se.fit)
pred_glm_nb_eta <- predict(mod_insect_glm_nb, newdata = data.for.pred, se.fit = TRUE)
pred_glm_nb_table <- data.frame(fit = spaMM::negbin()$linkinv(pred_glm_eta$fit))
pred_glm_nb_table$lwr <- spaMM::negbin()$linkinv(pred_glm_nb_eta$fit + qnorm(0.025)*pred_glm_nb_eta$se.fit)
pred_glm_nb_table$upr <- spaMM::negbin()$linkinv(pred_glm_nb_eta$fit + qnorm(0.975)*pred_glm_nb_eta$se.fit)Observe carefully the results:
plot(InsectSprays$count ~ as.numeric(InsectSprays$spray), xlim = c(0.5, 6.5), ylim = c(0, 30),
ylab = "Predicted number of instects", xlab = "Sprays", axes = FALSE)
axis(2, las = 2)
axis(1, at = 1:6, labels = levels(InsectSprays$spray))
box()
points(pred_lm[, "fit"] ~ I(-0.2 + 1:6), col = "blue", pch = 20)
arrows(x0 = -0.2 + 1:6, x1 = -0.2 + 1:6, y0 = pred_lm[, "lwr"],
y1 = pred_lm[, "upr"], code = 3, col = "blue", angle = 90, length = 0.1)
points(pred_bc_unboxcox[, "fit"] ~ I(-0.1 + 1:6), col = "green", pch = 20)
arrows(x0 = -0.1 + 1:6, x1 = -0.1 + 1:6, y0 = pred_bc_unboxcox[, "lwr"],
y1 = pred_bc_unboxcox[, "upr"], code = 3, col = "green", angle = 90, length = 0.1)
points(pred_glm_table[, "fit"] ~ I(0.1 + 1:6), col = "red", pch = 20)
arrows(x0 = 0.1 + 1:6, x1 = 0.1 + 1:6, y0 = pred_glm_table[, "lwr"],
y1 = pred_glm_table[, "upr"], code = 3, col = "red", angle = 90, length = 0.1)
points(pred_glm_nb_table[, "fit"] ~ I(0.2 + 1:6), col = "orange", pch = 20)
arrows(x0 = 0.2 + 1:6, x1 = 0.2 + 1:6, y0 = pred_glm_nb_table[, "lwr"],
y1 = pred_glm_table[, "upr"], code = 3, col = "orange", angle = 90, length = 0.1)
legend("top", fill = c("blue", "green", "red", "orange"), legend = c("LM", "LM + BoxCox", "GLM Poisson", "GLM Negative Binomial"), bty = "n")set.seed(1L)
logLik_diff_H0 <- replicate(1000, {
new.y <- round(((simulate(mod_insect_bc)[, 1] * bc$lambda) + 1)^(1/bc$lambda) - 1)
new.y_plus <- new.y + abs(min(new.y)) - min(new.y + abs(min(new.y)))
new.y_plus1 <- new.y_plus + 1
mod_insect_bis <- lm(new.y_plus1 ~ spray, data = InsectSprays)
bc_new <- car::powerTransform(mod_insect_bis)
new.y_plus1_bc <- car::bcPower(new.y_plus1, bc_new$lambda)
mod_insect_bc_new <- lm(new.y_plus1_bc ~ spray, data = InsectSprays)
mod_insect_glm_nb_new <- glm.nb(new.y_plus ~ spray, data = InsectSprays)
logLik(mod_insect_bc_new)[[1]] - logLik(mod_insect_glm_nb_new)[[1]]
})
(logLik_diff_obs <- logLik(mod_insect_bc)[[1]] - logLik(mod_insect_glm_nb)[[1]])## [1] 107.1409
## [1] 0.4075924
set.seed(1L)
logLik_diff_H0 <- replicate(1000, {
new.y <- simulate(mod_insect_glm_nb)[, 1]
new.y_plus <- new.y + abs(min(new.y)) - min(new.y + abs(min(new.y)))
new.y_plus1 <- new.y_plus + 1
mod_insect_bis <- lm(new.y_plus1 ~ spray, data = InsectSprays)
bc_new <- car::powerTransform(mod_insect_bis)
new.y_plus1_bc <- car::bcPower(new.y_plus1, bc_new$lambda)
mod_insect_bc_new <- lm(new.y_plus1_bc ~ spray, data = InsectSprays)
mod_insect_glm_nb_new <- glm.nb(new.y ~ spray, data = InsectSprays)
logLik(mod_insect_bc_new)[[1]] - logLik(mod_insect_glm_nb_new)[[1]]
})
(logLik_diff_obs <- logLik(mod_insect_bc)[[1]] - logLik(mod_insect_glm_nb)[[1]])## [1] 107.1409
## [1] 0.2237762
The two models do not differ, we could choose either!
This is not a general results. On other data LM on BoxCoxed data and a negative binomial fit of the data may greatly differ!!!
Try figuring out this one on your own! It’s all in the title.
## agegp alcgp tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day 0 40
## 2 25-34 0-39g/day 10-19 0 10
## 3 25-34 0-39g/day 20-29 0 6
## 4 25-34 0-39g/day 30+ 0 5
## 5 25-34 40-79 0-9g/day 0 27
## 6 25-34 40-79 10-19 0 7
## 7 25-34 40-79 20-29 0 4
## 8 25-34 40-79 30+ 0 7
## 9 25-34 80-119 0-9g/day 0 2
## 10 25-34 80-119 10-19 0 1
## 11 25-34 80-119 30+ 0 2
## 12 25-34 120+ 0-9g/day 0 1
## 13 25-34 120+ 10-19 1 1
## 14 25-34 120+ 20-29 0 1
## 15 25-34 120+ 30+ 0 2
## 16 35-44 0-39g/day 0-9g/day 0 60
## 17 35-44 0-39g/day 10-19 1 14
## 18 35-44 0-39g/day 20-29 0 7
## 19 35-44 0-39g/day 30+ 0 8
## 20 35-44 40-79 0-9g/day 0 35
## 21 35-44 40-79 10-19 3 23
## 22 35-44 40-79 20-29 1 14
## 23 35-44 40-79 30+ 0 8
## 24 35-44 80-119 0-9g/day 0 11
## 25 35-44 80-119 10-19 0 6
## 26 35-44 80-119 20-29 0 2
## 27 35-44 80-119 30+ 0 1
## 28 35-44 120+ 0-9g/day 2 3
## 29 35-44 120+ 10-19 0 3
## 30 35-44 120+ 20-29 2 4
## 31 45-54 0-39g/day 0-9g/day 1 46
## 32 45-54 0-39g/day 10-19 0 18
## 33 45-54 0-39g/day 20-29 0 10
## 34 45-54 0-39g/day 30+ 0 4
## 35 45-54 40-79 0-9g/day 6 38
## 36 45-54 40-79 10-19 4 21
## 37 45-54 40-79 20-29 5 15
## 38 45-54 40-79 30+ 5 7
## 39 45-54 80-119 0-9g/day 3 16
## 40 45-54 80-119 10-19 6 14
## 41 45-54 80-119 20-29 1 5
## 42 45-54 80-119 30+ 2 4
## 43 45-54 120+ 0-9g/day 4 4
## 44 45-54 120+ 10-19 3 4
## 45 45-54 120+ 20-29 2 3
## 46 45-54 120+ 30+ 4 4
## 47 55-64 0-39g/day 0-9g/day 2 49
## 48 55-64 0-39g/day 10-19 3 22
## 49 55-64 0-39g/day 20-29 3 12
## 50 55-64 0-39g/day 30+ 4 6
## 51 55-64 40-79 0-9g/day 9 40
## 52 55-64 40-79 10-19 6 21
## 53 55-64 40-79 20-29 4 17
## 54 55-64 40-79 30+ 3 6
## 55 55-64 80-119 0-9g/day 9 18
## 56 55-64 80-119 10-19 8 15
## 57 55-64 80-119 20-29 3 6
## 58 55-64 80-119 30+ 4 4
## 59 55-64 120+ 0-9g/day 5 10
## 60 55-64 120+ 10-19 6 7
## 61 55-64 120+ 20-29 2 3
## 62 55-64 120+ 30+ 5 6
## 63 65-74 0-39g/day 0-9g/day 5 48
## 64 65-74 0-39g/day 10-19 4 14
## 65 65-74 0-39g/day 20-29 2 7
## 66 65-74 0-39g/day 30+ 0 2
## 67 65-74 40-79 0-9g/day 17 34
## 68 65-74 40-79 10-19 3 10
## 69 65-74 40-79 20-29 5 9
## 70 65-74 80-119 0-9g/day 6 13
## 71 65-74 80-119 10-19 4 12
## 72 65-74 80-119 20-29 2 3
## 73 65-74 80-119 30+ 1 1
## 74 65-74 120+ 0-9g/day 3 4
## 75 65-74 120+ 10-19 1 2
## 76 65-74 120+ 20-29 1 1
## 77 65-74 120+ 30+ 1 1
## 78 75+ 0-39g/day 0-9g/day 1 18
## 79 75+ 0-39g/day 10-19 2 6
## 80 75+ 0-39g/day 30+ 1 3
## 81 75+ 40-79 0-9g/day 2 5
## 82 75+ 40-79 10-19 1 3
## 83 75+ 40-79 20-29 0 3
## 84 75+ 40-79 30+ 1 1
## 85 75+ 80-119 0-9g/day 1 1
## 86 75+ 80-119 10-19 1 1
## 87 75+ 120+ 0-9g/day 2 2
## 88 75+ 120+ 10-19 1 1
## 'data.frame': 88 obs. of 5 variables:
## $ agegp : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ alcgp : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
## $ tobgp : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
## $ ncases : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ncontrols: num 40 10 6 5 27 7 4 7 2 1 ...
The factors are ordered which will select for different default contrast, so we set them as unordered:
esoph$agegp_f <- factor(esoph$agegp, ordered = FALSE)
esoph$alcgp_f <- factor(esoph$alcgp, ordered = FALSE)
esoph$tobgp_f <- factor(esoph$tobgp, ordered = FALSE)## agegp alcgp tobgp ncases ncontrols agegp_f alcgp_f tobgp_f
## 25-34:15 0-39g/day:23 0-9g/day:24 Min. : 0.000 Min. : 1.00 25-34:15 0-39g/day:23 0-9g/day:24
## 35-44:15 40-79 :23 10-19 :24 1st Qu.: 0.000 1st Qu.: 3.00 35-44:15 40-79 :23 10-19 :24
## 45-54:16 80-119 :21 20-29 :20 Median : 1.000 Median : 6.00 45-54:16 80-119 :21 20-29 :20
## 55-64:16 120+ :21 30+ :20 Mean : 2.273 Mean :11.08 55-64:16 120+ :21 30+ :20
## 65-74:15 3rd Qu.: 4.000 3rd Qu.:14.00 65-74:15
## 75+ :11 Max. :17.000 Max. :60.00 75+ :11
esoph$cancer_freq <- esoph$ncases / (esoph$ncases + esoph$ncontrols)
esoph$cancer_N <- esoph$ncases + esoph$ncontrols
summary(esoph$cancer_freq) ## proportion of cancer## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.2111 0.2089 0.3679 0.5000
##
## 0-9g/day 10-19 20-29 30+
## 0-39g/day 6 6 5 6
## 40-79 6 6 6 5
## 80-119 6 6 4 5
## 120+ 6 6 5 4
## , , = 25-34
##
##
## 0-9g/day 10-19 20-29 30+
## 0-39g/day 1 1 1 1
## 40-79 1 1 1 1
## 80-119 1 1 0 1
## 120+ 1 1 1 1
##
## , , = 35-44
##
##
## 0-9g/day 10-19 20-29 30+
## 0-39g/day 1 1 1 1
## 40-79 1 1 1 1
## 80-119 1 1 1 1
## 120+ 1 1 1 0
##
## , , = 45-54
##
##
## 0-9g/day 10-19 20-29 30+
## 0-39g/day 1 1 1 1
## 40-79 1 1 1 1
## 80-119 1 1 1 1
## 120+ 1 1 1 1
##
## , , = 55-64
##
##
## 0-9g/day 10-19 20-29 30+
## 0-39g/day 1 1 1 1
## 40-79 1 1 1 1
## 80-119 1 1 1 1
## 120+ 1 1 1 1
##
## , , = 65-74
##
##
## 0-9g/day 10-19 20-29 30+
## 0-39g/day 1 1 1 1
## 40-79 1 1 1 0
## 80-119 1 1 1 1
## 120+ 1 1 1 1
##
## , , = 75+
##
##
## 0-9g/day 10-19 20-29 30+
## 0-39g/day 1 1 0 1
## 40-79 1 1 1 1
## 80-119 1 1 0 0
## 120+ 1 1 0 0
We plot the data:
mod_esoph_logit <- glm(cbind(ncases, ncontrols) ~ agegp_f + alcgp_f + tobgp_f, data = esoph,
family = binomial(link = "logit"))
mod_esoph_cauchit <- glm(cbind(ncases, ncontrols) ~ agegp_f + alcgp_f + tobgp_f, data = esoph,
family = binomial(link = "cauchit"))
mod_esoph_probit <- glm(cbind(ncases, ncontrols) ~ agegp_f + alcgp_f + tobgp_f, data = esoph,
family = binomial(link = "probit"))
AIC(mod_esoph_logit)## [1] 225.454
## [1] 242.4252
## [1] 221.825
The probit model fits the data best.
mod_esoph_probit_H0 <- glm(cbind(ncases, ncontrols) ~ 1, data = esoph,
family = binomial(link = "probit"))
anova(mod_esoph_probit, mod_esoph_probit_H0, test = "LR")## Analysis of Deviance Table
##
## Model 1: cbind(ncases, ncontrols) ~ agegp_f + alcgp_f + tobgp_f
## Model 2: cbind(ncases, ncontrols) ~ 1
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 76 50.344
## 2 87 227.241 -11 -176.9 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model is much better than a null model!
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.098545, p-value = 0.3599
## alternative hypothesis: two-sided
##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 2.0787, p-value = 0.6445
## alternative hypothesis: true autocorrelation is greater than 0
##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 2.0433, p-value = 0.5807
## alternative hypothesis: true autocorrelation is greater than 0
##
## DHARMa zero-inflation test via comparison to expected zeros with simulation under H0 = fitted model
##
## data: esoph_probit_resid
## ratioObsExp = 0.96783, p-value = 0.576
## alternative hypothesis: more
##
## DHARMa nonparametric overdispersion test via comparison to simulation under H0 = fitted model
##
## data: esoph_probit_resid
## dispersion = 0.74473, p-value = 0.17
## alternative hypothesis: both
## Analysis of Deviance Table (Type II tests)
##
## Response: cbind(ncases, ncontrols)
## LR Chisq Df Pr(>Chisq)
## agegp_f 80.109 5 7.961e-16 ***
## alcgp_f 69.533 3 5.373e-15 ***
## tobgp_f 11.593 3 0.008915 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: the two packages differ in how they compute predictions. The package effects seems to here compute the average effect of the non focal factor by weighing them by their sample size. The package visreg seems to here select the level with the highest number of observation for the non focal factor. I am not sure what it does when there are equalities in sample size… Doing predictions yourself is perhaps the easiest way to know what is actually happening!
Let us compare extreme situations
newdata_esoph_young <- expand.grid(agegp_f = c("25-34"),
alcgp_f = c("0-39g/day", "120+"),
tobgp_f = c("0-9g/day", "30+"))
pred_esoph_young <- predict(mod_esoph_probit, newdata = newdata_esoph_young, type = "link", se.fit = TRUE)
newdata_esoph_young$pred <- binomial(link = "probit")$linkinv(pred_esoph_young$fit)
newdata_esoph_young$lwr <- binomial(link = "probit")$linkinv(pred_esoph_young$fit + qnorm(0.025)*pred_esoph_young$se.fit)
newdata_esoph_young$upr <- binomial(link = "probit")$linkinv(pred_esoph_young$fit + qnorm(0.975)*pred_esoph_young$se.fit)
newdata_esoph_young## agegp_f alcgp_f tobgp_f pred lwr upr
## 1 25-34 0-39g/day 0-9g/day 0.0007619471 0.0000273923 0.01055611
## 2 25-34 120+ 0-9g/day 0.0262383385 0.0024738469 0.14276353
## 3 25-34 0-39g/day 30+ 0.0039002480 0.0002025362 0.03717823
## 4 25-34 120+ 30+ 0.0764025052 0.0106660728 0.28864986
newdata_esoph_old <- expand.grid(agegp_f = c("75+"),
alcgp_f = c("0-39g/day", "120+"),
tobgp_f = c("0-9g/day", "30+"))
pred_esoph_old <- predict(mod_esoph_probit, newdata = newdata_esoph_old, type = "link", se.fit = TRUE)
newdata_esoph_old$pred <- binomial(link = "probit")$linkinv(pred_esoph_old$fit)
newdata_esoph_old$lwr <- binomial(link = "probit")$linkinv(pred_esoph_old$fit + qnorm(0.025)*pred_esoph_old$se.fit)
newdata_esoph_old$upr <- binomial(link = "probit")$linkinv(pred_esoph_old$fit + qnorm(0.975)*pred_esoph_old$se.fit)
newdata_esoph_old## agegp_f alcgp_f tobgp_f pred lwr upr
## 1 75+ 0-39g/day 0-9g/day 0.09511196 0.0427947 0.1838748
## 2 75+ 120+ 0-9g/day 0.46850630 0.2953570 0.6479350
## 3 75+ 0-39g/day 30+ 0.21173725 0.1003285 0.3740544
## 4 75+ 120+ 30+ 0.66657817 0.4708055 0.8249030
So alcool seems to be a bit worse than smoking but we are extrapolating a bit since very few individuals correspond to the predicted conditions! (see table() call above)
One thing is clear, not smoking and not drinking seems to be the way to avoid these cancers.
esoph$drink_smoke <- factor(paste(esoph$alcgp_f, esoph$tobgp_f, sep = "_"))
mod_esoph_probit2 <- glm(cbind(ncases, ncontrols) ~ agegp_f + drink_smoke, data = esoph,
family = binomial(link = "probit"))
Anova(mod_esoph_probit2)## Analysis of Deviance Table (Type II tests)
##
## Response: cbind(ncases, ncontrols)
## LR Chisq Df Pr(>Chisq)
## agegp_f 79.061 5 1.319e-15 ***
## drink_smoke 94.224 15 1.606e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## glm(formula = cbind(ncases, ncontrols) ~ agegp_f + drink_smoke,
## family = binomial(link = "probit"), data = esoph)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8509 -0.5300 -0.2056 0.2616 1.9162
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.3734 0.4531 -7.446 9.62e-14 ***
## agegp_f35-44 0.7288 0.4542 1.604 0.108615
## agegp_f45-54 1.4536 0.4333 3.355 0.000794 ***
## agegp_f55-64 1.6772 0.4305 3.896 9.78e-05 ***
## agegp_f65-74 1.9048 0.4355 4.373 1.22e-05 ***
## agegp_f75+ 1.8491 0.4675 3.955 7.64e-05 ***
## drink_smoke0-39g/day_10-19 0.5839 0.2418 2.415 0.015737 *
## drink_smoke0-39g/day_20-29 0.5995 0.3065 1.956 0.050454 .
## drink_smoke0-39g/day_30+ 0.9204 0.3302 2.787 0.005320 **
## drink_smoke120+_0-9g/day 1.5668 0.2609 6.006 1.90e-09 ***
## drink_smoke120+_10-19 1.6410 0.2891 5.676 1.38e-08 ***
## drink_smoke120+_20-29 1.7381 0.3495 4.973 6.58e-07 ***
## drink_smoke120+_30+ 1.6849 0.3173 5.311 1.09e-07 ***
## drink_smoke40-79_0-9g/day 0.8609 0.1939 4.441 8.95e-06 ***
## drink_smoke40-79_10-19 0.9881 0.2234 4.423 9.71e-06 ***
## drink_smoke40-79_20-29 1.0136 0.2341 4.329 1.50e-05 ***
## drink_smoke40-79_30+ 1.3801 0.2923 4.721 2.34e-06 ***
## drink_smoke80-119_0-9g/day 1.0703 0.2257 4.743 2.11e-06 ***
## drink_smoke80-119_10-19 1.1832 0.2303 5.137 2.80e-07 ***
## drink_smoke80-119_20-29 1.1478 0.3330 3.447 0.000566 ***
## drink_smoke80-119_30+ 1.5511 0.3470 4.470 7.83e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 227.241 on 87 degrees of freedom
## Residual deviance: 44.889 on 67 degrees of freedom
## AIC: 234.37
##
## Number of Fisher Scoring iterations: 6
Ideally we should again check all assumptions (you should do it!).
Let us do the minimum here:
Let us now compare the alcool and drink effects.
##
## 0-39g/day_0-9g/day 0-39g/day_10-19 0-39g/day_20-29 0-39g/day_30+ 120+_0-9g/day 120+_10-19 120+_20-29 120+_30+ 40-79_0-9g/day 40-79_10-19
## 6 6 5 6 6 6 5 4 6 6
## 40-79_20-29 40-79_30+ 80-119_0-9g/day 80-119_10-19 80-119_20-29 80-119_30+
## 6 5 6 6 4 5
contr_base <- matrix(c(1, rep(0, 15)), nrow = 1) ## define the desing matrix for the posthoc test
contr <- contr_base
contr[levels(esoph$drink_smoke) == "120+_0-9g/day"] <- 1
contr[levels(esoph$drink_smoke) == "0-39g/day_30+"] <- -1
colnames(contr) <- levels(esoph$drink_smoke)
rownames(contr) <- "alcool - smoke"
contr## 0-39g/day_0-9g/day 0-39g/day_10-19 0-39g/day_20-29 0-39g/day_30+ 120+_0-9g/day 120+_10-19 120+_20-29 120+_30+ 40-79_0-9g/day 40-79_10-19 40-79_20-29 40-79_30+ 80-119_0-9g/day
## alcool - smoke 1 0 0 -1 1 0 0 0 0 0 0 0 0
## 80-119_10-19 80-119_20-29 80-119_30+
## alcool - smoke 0 0 0
posthoc <- glht(mod_esoph_probit2, linfct = mcp(drink_smoke = contr))
par(oma = c(1, 5, 1, 1))
plot(posthoc)##
## Simultaneous Confidence Intervals
##
## Multiple Comparisons of Means: User-defined Contrasts
##
##
## Fit: glm(formula = cbind(ncases, ncontrols) ~ agegp_f + drink_smoke,
## family = binomial(link = "probit"), data = esoph)
##
## Quantile = 1.96
## 95% family-wise confidence level
##
##
## Linear Hypotheses:
## Estimate lwr upr
## alcool - smoke == 0 0.64643 -0.04785 1.34071
##
## General Linear Hypotheses
##
## Multiple Comparisons of Means: User-defined Contrasts
##
##
## Linear Hypotheses:
## Estimate
## alcool - smoke == 0 0.6464
##
## Global Test:
## Chisq DF Pr(>Chisq)
## 1 3.33 1 0.06802
We have compared two different situations, using a posthoc approach.
esoph$alc_num <- unclass(esoph$alcgp)
esoph$tob_num <- unclass(esoph$tobgp)
mod_esoph_probit3 <- glm(cbind(ncases, ncontrols) ~ agegp_f + alc_num + tob_num, data = esoph,
family = binomial(link = "probit"))
plot(simulateResiduals(mod_esoph_probit3, refit = TRUE, n = 1000))## [1] 221.825
## [1] 234.3696
## [1] 218.8386
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -4.66120665 -2.8095856
## agegp_f35-44 -0.03856800 1.8280202
## agegp_f45-54 0.71465265 2.4983633
## agegp_f55-64 0.94248804 2.7131850
## agegp_f65-74 1.14747222 2.9354270
## agegp_f75+ 1.02421193 2.9346507
## alc_num 0.29182068 0.4810442
## tob_num 0.06227855 0.2460660
This last model seems quite good and it allows to answer our question!
## survived sex age passengerClass
## Allen, Miss. Elisabeth Walton yes female 29.0000 1st
## Allison, Master. Hudson Trevor yes male 0.9167 1st
## Allison, Miss. Helen Loraine no female 2.0000 1st
## Allison, Mr. Hudson Joshua Crei no male 30.0000 1st
## Allison, Mrs. Hudson J C (Bessi no female 25.0000 1st
## Anderson, Mr. Harry yes male 48.0000 1st
## 'data.frame': 1309 obs. of 4 variables:
## $ survived : Factor w/ 2 levels "no","yes": 2 2 1 1 1 2 2 1 2 1 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
## $ age : num 29 0.917 2 30 25 ...
## $ passengerClass: Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
## survived sex age passengerClass
## no :809 female:466 Min. : 0.1667 1st:323
## yes:500 male :843 1st Qu.:21.0000 2nd:277
## Median :28.0000 3rd:709
## Mean :29.8811
## 3rd Qu.:39.0000
## Max. :80.0000
## NA's :263
## female male
## 0.7274678 0.1909846
## 1st 2nd 3rd
## 0.6191950 0.4296029 0.2552891
## passengerClass
## sex 1st 2nd 3rd
## female 144 106 216
## male 179 171 493
The effect of age does not seem linear…
TitanicSurvival$surv <- TitanicSurvival$survived == "yes"
TitanicSurvival$age_f <- cut(TitanicSurvival$age, breaks = c(0, 10, 25, 35, 50, 81), include.lowest = TRUE)
table(TitanicSurvival$age_f)##
## [0,10] (10,25] (25,35] (35,50] (50,81]
## 86 357 281 227 95
We fit a serie of models with different links and considering age as factor or continuous:
mod_logit <- glm(surv ~ age_f+passengerClass+sex, data = TitanicSurvival, family = binomial(link = "logit"))
mod_probit <- glm(surv ~ age_f+passengerClass+sex, data = TitanicSurvival, family = binomial(link = "probit"))
mod_cauchit <- glm(surv ~ age_f+passengerClass+sex, data = TitanicSurvival, family = binomial(link = "cauchit"))
mod_logit2 <- glm(surv ~ age+passengerClass+sex, data = TitanicSurvival,
family = binomial(link = "logit"))
mod_probit2 <- glm(surv ~ age+passengerClass+sex, data = TitanicSurvival,
family = binomial(link = "probit"))
mod_cauchit2 <- glm(surv ~ age+passengerClass+sex, data = TitanicSurvival,
family = binomial(link = "cauchit"))
rbind(AIC(mod_logit),
AIC(mod_probit),
AIC(mod_cauchit),
AIC(mod_logit2),
AIC(mod_probit2),
AIC(mod_cauchit2))## [,1]
## [1,] 995.0867
## [2,] 997.4459
## [3,] 978.3824
## [4,] 992.4531
## [5,] 994.5528
## [6,] 981.3796
The cauchit model with age as a factor provides a much better fit than other models.
library(DHARMa)
resid_Titanic <- simulateResiduals(mod_cauchit, refit = TRUE, n = 1000)
plot(resid_Titanic)##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 1.9828, p-value = 0.3902
## alternative hypothesis: true autocorrelation is greater than 0
##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 1.9914, p-value = 0.4446
## alternative hypothesis: true autocorrelation is greater than 0
No need to test for overdispersion in a binary model!
data.for.pred <- with(data = TitanicSurvival,
expand.grid(age_f = levels(age_f), passengerClass = levels(passengerClass), sex = levels(sex)))
pred <- predict(mod_cauchit, newdata = data.for.pred, type = "link")
data.to.plot <- cbind(data.for.pred, pred = pred)
coplot(pred ~ unclass(age_f) | passengerClass + sex, data = data.to.plot)library(lattice) ## same, more fancy
xyplot(pred ~ unclass(age_f) | passengerClass + sex, data = data.to.plot)Another way to cope with the non-linearity of the age effect would be to do a GAM model:
library(mgcv)
mod_cauchit_gam <- gam(surv ~ s(age) + passengerClass + sex, data = TitanicSurvival, family = binomial(link = "cauchit"))
AIC(mod_cauchit_gam)## [1] 972.3109
The model is a bit better, we could use it.
## Analysis of Deviance Table (Type II tests)
##
## Response: surv
## LR Chisq Df Pr(>Chisq)
## age_f 37.089 4 1.727e-07 ***
## passengerClass 124.906 2 < 2.2e-16 ***
## sex 292.157 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Family: binomial
## Link function: cauchit
##
## Formula:
## surv ~ s(age) + passengerClass + sex
##
## Parametric Terms:
## df Chi.sq p-value
## passengerClass 2 67.04 2.77e-15
## sex 1 87.49 < 2e-16
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(age) 7.488 8.327 38.67 7.33e-06
data.for.pred2 <- with(data = TitanicSurvival,
expand.grid(age_f = levels(TitanicSurvival$age_f), passengerClass = levels(passengerClass), sex = levels(sex)))
data.for.pred2$age <- NA
data.for.pred2$age[data.for.pred2$age_f == "[0,10]"] <- (10 - 0)/2
data.for.pred2$age[data.for.pred2$age_f == "(10,25]"] <- (25 + 10)/2
data.for.pred2$age[data.for.pred2$age_f == "(25,35]"] <- (25 + 35)/2
data.for.pred2$age[data.for.pred2$age_f == "(35,50]"] <- (35 + 50)/2
data.for.pred2$age[data.for.pred2$age_f == "(50,81]"] <- (50 + 81)/2
pred_age_f <- predict(mod_cauchit, newdata = data.for.pred2, type = "response")
pred_age_gam <- predict(mod_cauchit_gam, newdata = data.for.pred2, type = "response")
data.to.plot2 <- cbind(data.for.pred2, pred_age_f = pred_age_f, pred_age_gam = pred_age_gam)
pred_f_1 <- subset(data.to.plot2, sex == "female" & passengerClass == "1st")
pred_f_2 <- subset(data.to.plot2, sex == "female" & passengerClass == "2nd")
pred_f_3 <- subset(data.to.plot2, sex == "female" & passengerClass == "3rd")
pred_m_1 <- subset(data.to.plot2, sex == "male" & passengerClass == "1st")
pred_m_2 <- subset(data.to.plot2, sex == "male" & passengerClass == "2nd")
pred_m_3 <- subset(data.to.plot2, sex == "male" & passengerClass == "3rd")par(mfrow = c(1, 2))
plot(pred_f_1$pred_age_f ~ pred_f_1$age, type = "o", lwd = 2, ylim = c(0, 1.4),
ylab = "Prob. Survival", xlab = "Age", main = "Age as factor")
points(pred_f_2$pred_age_f ~ pred_f_2$age, type = "o", lwd = 2, col = 2)
points(pred_f_3$pred_age_f ~ pred_f_3$age, type = "o", lwd = 2, col = 3)
points(pred_m_1$pred_age_f ~ pred_m_1$age, type = "o", lwd = 2, col = 1, lty = 2)
points(pred_m_2$pred_age_f ~ pred_m_2$age, type = "o", lwd = 2, col = 2, lty = 2)
points(pred_m_3$pred_age_f ~ pred_m_3$age, type = "o", lwd = 2, col = 3, lty = 2)
legend("top", lty = c(1, 1, 1, 2, 2, 2), col = c(1:3, 1:3),
legend = c("f_1", "f_2", "f_3", "m_1", "m_2", "m_3"),
lwd = rep(2, 6), cex = 0.6)
plot(pred_f_1$pred_age_gam ~ pred_f_2$age, type = "o", ylim = c(0, 1.4),
ylab = "Prob. Survival", xlab = "Age", main = "Age as GAM")
points(pred_f_2$pred_age_gam ~ pred_f_2$age, type = "o", col = 2)
points(pred_f_3$pred_age_gam ~ pred_f_3$age, type = "o", col = 3)
points(pred_m_1$pred_age_gam ~ pred_m_1$age, type = "o", col = 1, lty = 2)
points(pred_m_2$pred_age_gam ~ pred_m_2$age, type = "o", col = 2, lty = 2)
points(pred_m_3$pred_age_gam ~ pred_m_3$age, type = "o", col = 3, lty = 2)
legend("top", lty = c(1, 1, 1, 2, 2, 2), col = c(1:3, 1:3),
legend = c("f_1", "f_2", "f_3", "m_1", "m_2", "m_3"),
lwd = rep(1, 6), cex = 0.6)As the model is not fitted with link = logit, the odd ratio are not constant with age…
p1 <- pred_f_1[pred_f_1$age == 5, "pred_age_f"]
p2 <- pred_m_3[pred_m_3$age == 5, "pred_age_f"]
(p1/(1 - p1)) / (p2/(1 - p2))## [1] 41.30262
p1 <- pred_f_1[pred_f_1$age == 65.5, "pred_age_f"]
p2 <- pred_m_3[pred_m_3$age == 65.5, "pred_age_f"]
(p1/(1 - p1)) / (p2/(1 - p2))## [1] 80.77962
p1 <- pred_f_1[pred_f_1$age == 5, "pred_age_gam"]
p2 <- pred_m_3[pred_m_3$age == 5, "pred_age_gam"]
(p1/(1 - p1)) / (p2/(1 - p2))## [1] 71.77181
p1 <- pred_f_1[pred_f_1$age == 65.5, "pred_age_gam"]
p2 <- pred_m_3[pred_m_3$age == 65.5, "pred_age_gam"]
(p1/(1 - p1)) / (p2/(1 - p2))## [1] 47.88521
Note: the differences between the two models are at the extrems for which have few observations.
## sex
## passengerClass female male
## 1st 1 3
## 2nd 11 11
## 3rd 29 31
## sex
## passengerClass female male
## 1st 27 37
## 2nd 4 16
## 3rd 1 10
## 'data.frame': 23 obs. of 5 variables:
## $ oring_tot: int 6 6 6 6 6 6 6 6 6 6 ...
## $ oring_dt : int 0 1 0 0 0 0 0 0 1 1 ...
## $ temp : int 66 70 69 68 67 72 73 70 57 63 ...
## $ psi : int 50 50 50 50 50 50 100 100 200 200 ...
## $ flight : int 1 2 3 4 5 6 7 8 9 10 ...
## oring_tot oring_dt temp psi flight
## Min. :6 Min. :0.0000 Min. :53.00 Min. : 50.0 Min. : 1.0
## 1st Qu.:6 1st Qu.:0.0000 1st Qu.:67.00 1st Qu.: 75.0 1st Qu.: 6.5
## Median :6 Median :0.0000 Median :70.00 Median :200.0 Median :12.0
## Mean :6 Mean :0.3043 Mean :69.57 Mean :152.2 Mean :12.0
## 3rd Qu.:6 3rd Qu.:0.5000 3rd Qu.:75.00 3rd Qu.:200.0 3rd Qu.:17.5
## Max. :6 Max. :2.0000 Max. :81.00 Max. :200.0 Max. :23.0
## nb_pb_orings
## nb_orings 0 1 2
## 6 17 5 1
Notice the user of jitter!
mod_binom_logit <- glm(cbind(oring_fine, oring_dt) ~ temp, data = Challenger,
family = binomial(link = "logit"))
mod_binom_probit <- glm(cbind(oring_fine, oring_dt) ~ temp, data = Challenger,
family = binomial(link = "probit"))
mod_binom_cauchit <- glm(cbind(oring_fine, oring_dt) ~ temp, data = Challenger,
family = binomial(link = "cauchit"))
AIC(mod_binom_logit)## [1] 24.8652
## [1] 24.76826
## [1] 26.63011
We could choose the logit or the probit model. It would be easier with the logit to interprete the estimates but we do not need this here, so let us consider the probit model since it is best.
r_probit2 <- simulateResiduals(mod_binom_probit, n = 1000, seed = 2)
plot(r_probit2) ## replicate with other seed##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.26857, p-value = 0.05908
## alternative hypothesis: two-sided
##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 0.92502, p-value = 0.002246
## alternative hypothesis: true autocorrelation is greater than 0
##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 2.2108, p-value = 0.6963
## alternative hypothesis: true autocorrelation is greater than 0
##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 2.2108, p-value = 0.6963
## alternative hypothesis: true autocorrelation is greater than 0
##
## DHARMa zero-inflation test via comparison to expected zeros with simulation under H0 = fitted model
##
## data: r_probit
## ratioObsExp = 0, p-value = 0.001
## alternative hypothesis: more
##
## FALSE
## 23
Let us try to check the zero-inflation differently:
test <- replicate(1000, sum(simulate(mod_binom_probit, 1)[[1]][, 2] < 1))
hist(test)
abline(v = sum(Challenger$oring_dt < 1), col = "red", lwd = 2)## Analysis of Deviance Table (Type II tests)
##
## Response: cbind(oring_fine, oring_dt)
## LR Chisq Df Pr(>Chisq)
## temp 11.276 1 0.0007853 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## glm(formula = cbind(oring_fine, oring_dt) ~ temp, family = binomial(link = "probit"),
## data = Challenger)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5377 0.1555 0.2984 0.5585 0.7894
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.21720 1.85200 -2.277 0.02278 *
## temp 0.08873 0.02888 3.072 0.00212 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20.7057 on 22 degrees of freedom
## Residual deviance: 9.4301 on 21 degrees of freedom
## AIC: 24.768
##
## Number of Fisher Scoring iterations: 6
pred <- data.frame(temp = 31:81)
pred_probit <- predict(mod_binom_probit, newdata = pred, se.fit = TRUE)
pred$pred <- binomial(link = "probit")$linkinv(pred_probit$fit)
pred$lwr <- binomial(link = "probit")$linkinv(
pred_probit$fit + qnorm(0.025) * pred_probit$se.fit)
pred$upr <- binomial(link = "probit")$linkinv(
pred_probit$fit + qnorm(0.975) * pred_probit$se.fit)
plot(1 - pred$pred ~ pred$temp, type = "l",
ylim = c(0, 1), ylab = "Probability of failure",
xlab = "Temp (F)")
points(1 - pred$lwr ~ pred$temp, type = "l", lty = 2)
points(1 - pred$upr ~ pred$temp, type = "l", lty = 2)
points(oring_dt / oring_tot ~ temp, data = Challenger, col = "red")
## Predict logit
pred2 <- data.frame(temp = 31:81)
pred_logit <- predict(mod_binom_logit, newdata = pred, se.fit = TRUE)
pred2$pred <- binomial(link = "logit")$linkinv(pred_logit$fit)
pred2$lwr <- binomial(link = "logit")$linkinv(
pred_logit$fit + qnorm(0.025) * pred_logit$se.fit)
pred2$upr <- binomial(link = "logit")$linkinv(
pred_logit$fit + qnorm(0.975) * pred_logit$se.fit)
points(1 - pred2$pred ~ pred2$temp, type = "l", col = 3)
points(1 - pred2$lwr ~ pred2$temp, type = "l", lty = 2, col = 3)
points(1 - pred2$upr ~ pred2$temp, type = "l", lty = 2, col = 3)Since a single failure could be fatal, perhaps a binary model would make more sense!
You could try to replicate the original study that excluded all the OK flights!
backward.## 'data.frame': 14875 obs. of 19 variables:
## $ sex : Factor w/ 2 levels "Boy","Girl": 2 1 1 2 2 1 2 2 2 1 ...
## $ age : num 3660 3841 3675 3694 3742 ...
## $ weight : num 29.4 31.6 34.9 30.2 NA 24.6 40.8 29.9 25.6 29.1 ...
## $ height : num 134 141 143 140 NA ...
## $ glasses : Factor w/ 2 levels "No","Yes": 1 1 NA 1 NA 1 1 1 1 1 ...
## $ cigarettes_mum : num 0 20 20 0 0 15 20 0 30 0 ...
## $ cigarettes_dad : num 0 0 0 0 0 0 30 0 0 0 ...
## $ cigarettes : num 0 20 20 0 0 15 50 0 30 0 ...
## $ cigarettes_kid : Factor w/ 8 levels "Never","Tried once",..: 1 1 1 1 1 2 1 1 1 1 ...
## $ cigarettes_friends: Factor w/ 3 levels "None of them",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ bronchitis : num 0 0 0 0 0 0 0 0 0 0 ...
## $ drink : Factor w/ 4 levels "2 to 3 times a week",..: 4 3 4 4 NA 3 3 4 1 4 ...
## $ milk : num 0 1 3 0 0 3 2 3 0 1 ...
## $ coca : num 0 1 1 0 0 0 1 0 2 1 ...
## $ backward : num 20 16 4 12 NA 14 8 20 6 7 ...
## $ mother_height : num 157 157 155 160 NA 155 155 165 168 157 ...
## $ father_height : num 168 NA 175 NA NA 173 173 178 168 175 ...
## $ mother_weight : num 51 51 50 63 NA 50 57 90 61 53 ...
## $ father_weight : num 79 NA 72 NA NA 76 69 79 65 73 ...
## $sex
## Boy Girl
## 7713 7162
##
## $age
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3557 3666 3685 3708 3721 4237 3253
##
## $weight
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 23.00 28.60 31.80 32.48 35.40 51.00 2433
##
## $height
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 117.0 134.2 138.4 138.6 142.9 163.2 2147
##
## $glasses
## No Yes NA's
## 11997 1454 1424
##
## $cigarettes_mum
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 5.848 10.000 80.000
##
## $cigarettes_dad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 7.337 15.000 99.000
##
## $cigarettes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 4.00 13.18 20.00 120.00
##
## $cigarettes_kid
## Never Tried once Tried twice Lt 1 per week About 1 per week 2-5 per week About 1 per day More than 1 per day NA's
## 10808 1286 324 68 45 37 8 15 2284
##
## $cigarettes_friends
## None of them Some of them Most of them NA's
## 10552 1755 261 2307
##
## $bronchitis
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1498 0.0000 1.0000
##
## $drink
## 2 to 3 times a week Most days Not at all Once a week or less NA's
## 539 148 6689 5354 2145
##
## $milk
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 1.000 1.000 1.679 2.000 26.000 2580
##
## $coca
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.889 1.000 51.000 2580
##
## $backward
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 11.00 18.00 15.44 20.00 20.00 2318
##
## $mother_height
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 124.0 157.0 160.0 161.3 165.0 188.0 1473
##
## $father_height
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 137.0 170.0 175.0 175.3 180.0 211.0 2068
##
## $mother_weight
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 29.00 54.00 60.00 61.01 66.00 164.00 1603
##
## $father_weight
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 41.00 69.00 75.00 75.24 82.00 167.00 2334
UK$cigarettes_kid_bin <- UK$cigarettes_kid != "Never"
table(UK$cigarettes_kid_bin, UK$cigarettes_friends)##
## None of them Some of them Most of them
## FALSE 9484 1128 142
## TRUE 1045 612 113
## Warning in smoother(.x[subs], .y[subs], col = col[i], log.x = logged("x"), : could not fit smooth
## Warning in smoother(.x[subs], .y[subs], col = col[i], log.x = logged("x"), : could not fit smooth
mod_logit <- glm(cigarettes_kid_bin ~ sex*(cigarettes + age + cigarettes_friends),
family = binomial(link = "logit"), data = UK)
mod_probit <- glm(cigarettes_kid_bin ~ sex*(cigarettes + age + cigarettes_friends),
family = binomial(link = "probit"), data = UK)
mod_cauchit <- glm(cigarettes_kid_bin ~ sex*(cigarettes + age + cigarettes_friends),
family = binomial(link = "cauchit"), data = UK)
c(AIC(mod_logit), AIC(mod_probit), AIC(mod_cauchit))## [1] 8512.079 8513.535 8513.982
## Analysis of Deviance Table
##
## Model 1: cigarettes_kid_bin ~ sex * (cigarettes + age + cigarettes_friends)
## Model 2: cigarettes_kid_bin ~ 1
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 11471 8492.1
## 2 11480 9362.7 -9 -870.58 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
resid_mod_logit <- simulateResiduals(mod_logit, refit = TRUE, n = 1000) ## slow
plot(resid_mod_logit)##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.012736, p-value = 0.04824
## alternative hypothesis: two-sided
##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 2.0072, p-value = 0.6506
## alternative hypothesis: true autocorrelation is greater than 0
##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 1.995, p-value = 0.3935
## alternative hypothesis: true autocorrelation is greater than 0
## Analysis of Deviance Table (Type II tests)
##
## Response: cigarettes_kid_bin
## LR Chisq Df Pr(>Chisq)
## sex 106.38 1 < 2.2e-16 ***
## cigarettes 26.01 1 3.394e-07 ***
## age 11.78 1 0.0005995 ***
## cigarettes_friends 623.94 2 < 2.2e-16 ***
## sex:cigarettes 0.09 1 0.7663928
## sex:age 0.12 1 0.7329848
## sex:cigarettes_friends 0.76 2 0.6837439
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using the nice feature “cond” from visreg:
visreg(mod_logit, "cigarettes", by = "sex", scale = "response",
cond = list(cigarettes_friends = "None of them", age = mean(mod_logit$model$age)))visreg(mod_logit, "cigarettes", by = "sex", scale = "response",
cond = list(cigarettes_friends = "Most of them", age = mean(mod_logit$model$age)))visreg(mod_logit, "age", by = "sex", scale = "response",
cond = list(cigarettes = 0, cigarettes_friends = "None of them"))visreg(mod_logit, "cigarettes_friends", by = "sex", scale = "response",
cond = list(cigarettes = 0, age = mean(mod_logit$model$age)))Note: here, for the sake of simplicity we simply define menopause as the lack of periods in old ages.
We create the response variable:
We fit a first model to help us easily exploring the data used for fitting and not all data (there are many missing values in the dataset).
mod_menop_logit <- glm(menopause ~ age + bmi + smoked, data = HSE98women, family = binomial())
nrow(HSE98women) ## original number of observations## [1] 10556
## [1] 7218
We compare the variables before and after filtering to see if any obvious biais is introduced by the observations we discared:
## menopause age bmi smoked
## Mode :logical Min. : 2.00 Min. :12.39 Mode :logical
## FALSE:4303 1st Qu.:23.00 1st Qu.:21.11 FALSE:3959
## TRUE :3741 Median :39.00 Median :24.38 TRUE :4735
## NA's :2512 Mean :40.52 Mean :24.96 NA's :1862
## 3rd Qu.:58.00 3rd Qu.:28.30
## Max. :97.00 Max. :54.95
## NA's :1172
## menopause age bmi smoked
## Mode :logical Min. :18.00 Min. :13.80 Mode :logical
## FALSE:3967 1st Qu.:33.00 1st Qu.:22.79 FALSE:3214
## TRUE :3251 Median :46.00 Median :25.55 TRUE :4004
## Mean :47.38 Mean :26.47
## 3rd Qu.:61.00 3rd Qu.:29.30
## Max. :94.00 Max. :54.95
The only big discrepancy between the two datasets is that the minimal age is now 18 yrs old while the original dataset also contains children. That suggests that girls too young to have periods are not being considered, which will greatly help us to model menopause.
Let’s check the correlation between our two continuous predictor:
##
## Pearson's product-moment correlation
##
## data: HSE98women$age and HSE98women$bmi
## t = 50.874, df = 9382, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4489847 0.4807042
## sample estimates:
## cor
## 0.4649937
##
## Pearson's product-moment correlation
##
## data: mod_menop_logit$model$age and mod_menop_logit$model$bmi
## t = 13.85, df = 7216, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1383667 0.1833131
## sample estimates:
## cor
## 0.1609233
The correlation is not very high, especially in the dataset that is used for the fit, so multicollinearity should not be an issue.
We visualise the relationship between the predictor and the response:
library(car)
mod_menop_logit$model$smoked <- as.factor(mod_menop_logit$model$smoked) ## patch to solve bug in car 3.0
scatterplot(menopause ~ age + smoked, data = mod_menop_logit$model) There don’t seem to be any obvious interaction between the age and smoked or between BMI and smoked, so we won’t consider any. We will however consider that the effect of age may depend on the BMI.
mod_menop_logit <- glm(menopause ~ age * bmi + smoked, data = HSE98women, family = binomial())
mod_menop_probit <- glm(menopause ~ age * bmi + smoked, data = HSE98women, family = binomial(link = "probit"))
mod_menop_cauchit <- glm(menopause ~ age * bmi + smoked, data = HSE98women, family = binomial(link = "cauchit"))Because it is very difficult to explore the shape of the relationship between the age and bmi upon the probability of being menopaused, we will also fit general additive models (GAM), which allows parameters to vary along the predictors.
library(mgcv)
mod_menop_logit_gam <- gam(menopause ~ s(age, bmi) + smoked, data = HSE98women, family = binomial())
mod_menop_probit_gam <- gam(menopause ~ s(age, bmi) + smoked, data = HSE98women, family = binomial(link = "probit"))
mod_menop_cauchit_gam <- gam(menopause ~ s(age, bmi) + smoked, data = HSE98women, family = binomial(link = "cauchit"))cbind(logit = AIC(mod_menop_logit),
probit = AIC(mod_menop_probit),
cauchit = AIC(mod_menop_cauchit),
logit_GAM = AIC(mod_menop_logit_gam),
probit_GAM = AIC(mod_menop_probit_gam),
cauchit_GAM = AIC(mod_menop_cauchit_gam)
)## logit probit cauchit logit_GAM probit_GAM cauchit_GAM
## [1,] 3985.441 4097.282 3839.312 3720.37 3722.458 3733.11
The GAM logit model seems to be the best model, which will prevent us to express simply the effect of coefficients compared to a usual GLM. That’s too bad but as the model is much better, we will go for it.
We are going to test for the effect of the interaction by Likelihood ratio test:
mod_menop_logit_gam_no_int <- gam(menopause ~ s(age) + s(bmi) + smoked, data = HSE98women, family = binomial())
LR <- -2 * (logLik(mod_menop_logit_gam_no_int)[[1]] - logLik(mod_menop_logit_gam)[[1]])
DF <- (mod_menop_logit_gam$df.null - mod_menop_logit_gam$df.residual) - (mod_menop_logit_gam_no_int$df.null - mod_menop_logit_gam_no_int$df.residual)
(PV <- pchisq(LR, DF, lower.tail = FALSE))## [1] 0.175853
Note: you can also use lmtest::lrtest() but it rounds the DF :-/
The interaction between age and BMI is not significant (LRT, LR = 4.5, df = 2.69, p-value = 0.18).
Let’s now test the effect of BMI by Likelihood ratio test:
mod_menop_logit_gam_no_bmi <- gam(menopause ~ s(age) + smoked, data = HSE98women, family = binomial())
LR <- -2 * (logLik(mod_menop_logit_gam_no_bmi)[[1]] - logLik(mod_menop_logit_gam)[[1]])
DF <- (mod_menop_logit_gam$df.null - mod_menop_logit_gam$df.residual) - (mod_menop_logit_gam_no_bmi$df.null - mod_menop_logit_gam_no_bmi$df.residual)
(PV <- pchisq(LR, DF, lower.tail = FALSE))## [1] 5.514563e-157
The effect of BMI is strongly significant (LRT, LR = 732.7, df = 4.23, p-value < 0.001).
Note that models with GAM have non integer degrees of freedom. This is normal.
Let’s now test the effect of age by Likelihood ratio test:
mod_menop_logit_gam_no_age <- gam(menopause ~ s(bmi) + smoked, data = HSE98women, family = binomial())
LR <- -2 * (logLik(mod_menop_logit_gam_no_age)[[1]] - logLik(mod_menop_logit_gam)[[1]])
DF <- (mod_menop_logit_gam$df.null - mod_menop_logit_gam$df.residual) - (mod_menop_logit_gam_no_age$df.null - mod_menop_logit_gam_no_age$df.residual)
(PV <- pchisq(LR, DF, lower.tail = FALSE))## [1] 0
The effect of age is strongly significant (LRT, LR = 5900.1, df = 5.42, p-value < 0.001).
Let’s now test the effect of smoking by Likelihood ratio test:
mod_menop_logit_gam_never_smoked <- gam(menopause ~ s(age, bmi), data = HSE98women, family = binomial())
LR <- -2 * (logLik(mod_menop_logit_gam_never_smoked)[[1]] - logLik(mod_menop_logit_gam)[[1]])
DF <- (mod_menop_logit_gam$df.null - mod_menop_logit_gam$df.residual) - (mod_menop_logit_gam_never_smoked$df.null - mod_menop_logit_gam_never_smoked$df.residual)
(PV <- pchisq(LR, DF, lower.tail = FALSE))## [1] 0.0001646144
The effect of have smooked is also significant (LRT, LR = 14.5, df = 1.09, p-value < 0.001).
Let’s first explore the model output with the pacakge visreg, which also works on GAM models:
These plots show the patterns identified by the smoothing parameters in the GAM: the effect of age seems to change after 40 yrs and the effect of BMI seem to be changing between 23 and 32 yrs. It is difficult to account for such threshold effects by polynoms, so we will have to stick to GAM here…
Le’t us do the same plots on the scale of the response variable:
We will now plot the predictions for the model with the logit link, the model with the cauchit link and the model with the GAM model with the logit link to explore their differences. We will predict the influence of age and consider the median BMI and smokers.
age.for.pred <- seq(min(mod_menop_cauchit_gam$model$age), max(mod_menop_cauchit_gam$model$age), length = 100)
data.for.pred <- expand.grid(
age = age.for.pred,
bmi = median(mod_menop_cauchit_gam$model$bmi),
smoked = TRUE
)
gam.p <- predict(mod_menop_logit_gam, newdata = data.for.pred, se.fit = TRUE)
gam.upr <- binomial(link = "logit")$linkinv(gam.p$fit + qnorm(0.975) * gam.p$se.fit)
gam.lwr <- binomial(link = "logit")$linkinv(gam.p$fit + qnorm(0.025) * gam.p$se.fit)
gam.fit <- binomial(link = "logit")$linkinv(gam.p$fit)
cauchit.p <- predict(mod_menop_cauchit, newdata = data.for.pred, se.fit = TRUE)
cauchit.upr <- binomial(link = "cauchit")$linkinv(cauchit.p$fit + qnorm(0.975) * cauchit.p$se.fit)
cauchit.lwr <- binomial(link = "cauchit")$linkinv(cauchit.p$fit + qnorm(0.025) * cauchit.p$se.fit)
cauchit.fit <- binomial(link = "cauchit")$linkinv(cauchit.p$fit)
logit.p <- predict(mod_menop_logit, newdata = data.for.pred, se.fit = TRUE)
logit.upr <- binomial(link = "logit")$linkinv(logit.p$fit + qnorm(0.975) * logit.p$se.fit)
logit.lwr <- binomial(link = "logit")$linkinv(logit.p$fit + qnorm(0.025) * logit.p$se.fit)
logit.fit <- binomial(link = "logit")$linkinv(logit.p$fit)
plot(gam.fit ~ age.for.pred, type = "l", las = 1, ylab = "Probability of being menopaused", xlab = "Age (yrs)", ylim = c(0, 1))
points(gam.upr ~ age.for.pred, type = "l", lty = 2)
points(gam.lwr ~ age.for.pred, type = "l", lty = 2)
points(cauchit.fit ~ age.for.pred, type = "l", lty = 1, col = "green")
points(cauchit.upr ~ age.for.pred, type = "l", lty = 2, col = "green")
points(cauchit.lwr ~ age.for.pred, type = "l", lty = 2, col = "green")
points(logit.fit ~ age.for.pred, type = "l", lty = 1, col = "red")
points(logit.upr ~ age.for.pred, type = "l", lty = 2, col = "red")
points(logit.lwr ~ age.for.pred, type = "l", lty = 2, col = "red")
legend("topleft", fill = c("black", "green", "red"), legend = c("logit GAM", "cauchit", "logit"), bty = "n")Selecting the best model, let’s now look at the effect of BMI, still using smokers:
data.for.pred <- expand.grid(
age = age.for.pred,
bmi = c(18.5, 25, 30),
smoked = TRUE
)
gam.skiny.p <- predict(mod_menop_logit_gam, newdata = subset(data.for.pred, bmi == 18.5), se.fit = TRUE)
gam.skiny.upr <- binomial(link = "logit")$linkinv(gam.skiny.p$fit + qnorm(0.975) * gam.skiny.p$se.fit)
gam.skiny.lwr <- binomial(link = "logit")$linkinv(gam.skiny.p$fit + qnorm(0.025) * gam.skiny.p$se.fit)
gam.skiny.fit <- binomial(link = "logit")$linkinv(gam.skiny.p$fit)
gam.overweight.p <- predict(mod_menop_logit_gam, newdata = subset(data.for.pred, bmi == 25), se.fit = TRUE)
gam.overweight.upr <- binomial(link = "logit")$linkinv(gam.overweight.p$fit + qnorm(0.975) * gam.overweight.p$se.fit)
gam.overweight.lwr <- binomial(link = "logit")$linkinv(gam.overweight.p$fit + qnorm(0.025) * gam.overweight.p$se.fit)
gam.overweight.fit <- binomial(link = "logit")$linkinv(gam.overweight.p$fit)
gam.obese.p <- predict(mod_menop_logit_gam, newdata = subset(data.for.pred, bmi == 30), se.fit = TRUE)
gam.obese.upr <- binomial(link = "logit")$linkinv(gam.obese.p$fit + qnorm(0.975) * gam.obese.p$se.fit)
gam.obese.lwr <- binomial(link = "logit")$linkinv(gam.obese.p$fit + qnorm(0.025) * gam.obese.p$se.fit)
gam.obese.fit <- binomial(link = "logit")$linkinv(gam.obese.p$fit)
plot(gam.skiny.fit ~ age.for.pred, type = "l", las = 1, ylab = "Probability of being menopaused", xlab = "Age (yrs)", ylim = c(0, 1))
points(gam.skiny.upr ~ age.for.pred, type = "l", lty = 2)
points(gam.skiny.lwr ~ age.for.pred, type = "l", lty = 2)
points(gam.overweight.fit ~ age.for.pred, type = "l", lty = 1, col = "green")
points(gam.overweight.upr ~ age.for.pred, type = "l", lty = 2, col = "green")
points(gam.overweight.lwr ~ age.for.pred, type = "l", lty = 2, col = "green")
points(gam.obese.fit ~ age.for.pred, type = "l", lty = 1, col = "red")
points(gam.obese.upr ~ age.for.pred, type = "l", lty = 2, col = "red")
points(gam.obese.lwr ~ age.for.pred, type = "l", lty = 2, col = "red")
legend("topleft", fill = c("black", "green", "red"), legend = c("skiny", "overweight", "obese"), bty = "n")We can see on this plot that obesity seems to accelerate the onset of menopause, but it does not seem to make any difference later on.
We will now predict the proportion of women in menopause at 40, 45, 50, 55 and 60 yrs. To illustrate the effect ofthe BMI will will perform prediction for a BMI of 21.75 (middle of normal BMI range according to WHO) and 30 (lower limit for obesity) Kg/(m^2). Again, we will perform the predictions considering smoker since we have data for this category. We will also compute the confidence intervals (prediction intervals make little sense in the context of the prediction of proportion…).
data.for.pred <- expand.grid(
age = c(40, 45, 50, 55, 60),
bmi = c(21.75, 30),
smoked = TRUE
)
pred_normal <- predict(mod_menop_logit_gam, newdata = subset(data.for.pred, bmi == 21.75), se.fit = TRUE)
tab_normal <- cbind(
predict = plogis(pred_normal$fit),
lwr.CI = plogis(pred_normal$fit + qnorm(0.025)*pred_normal$se.fit),
upr.CI = plogis(pred_normal$fit + qnorm(0.975)*pred_normal$se.fit)
)
rownames(tab_normal) <- c(40, 45, 50, 55, 60)
round(tab_normal, 2)## predict lwr.CI upr.CI
## 40 0.08 0.06 0.10
## 45 0.18 0.15 0.22
## 50 0.51 0.45 0.56
## 55 0.87 0.83 0.90
## 60 0.97 0.95 0.98
This gives the information for individuals for normal BMI. Let’s now look at the corresponding proportion for the obeses:
pred_obses <- predict(mod_menop_logit_gam, newdata = subset(data.for.pred, bmi == 30), se.fit = TRUE)
tab_obses <- cbind(
predict = plogis(pred_obses$fit),
lwr.CI = plogis(pred_obses$fit + qnorm(0.025)*pred_obses$se.fit),
upr.CI = plogis(pred_obses$fit + qnorm(0.975)*pred_obses$se.fit)
)
rownames(tab_obses) <- c(40, 45, 50, 55, 60)
round(tab_obses, 2)## predict lwr.CI upr.CI
## 40 0.14 0.11 0.17
## 45 0.26 0.22 0.31
## 50 0.57 0.51 0.62
## 55 0.87 0.84 0.90
## 60 0.97 0.95 0.98
These tables confirm our expectation concerning the influenc of BMI upon the onset of menopause.
For the sake of comparion, let’s compare the first table to what we would have predicted using the cauchit model or the logit GLM model:
data.for.pred <- expand.grid(age = c(40, 45, 50, 55, 60), bmi = 21.75, smoked = TRUE)
pred_GAM <- predict(mod_menop_logit_gam, newdata = subset(data.for.pred, bmi == 21.75), se.fit = TRUE)
pred_logit <- predict(mod_menop_logit, newdata = subset(data.for.pred, bmi == 21.75), se.fit = TRUE)
pred_cauchit <- predict(mod_menop_cauchit, newdata = subset(data.for.pred, bmi == 21.75), se.fit = TRUE)
tab <- cbind(
logit_GAM = plogis(pred_GAM$fit),
logit_GLM = plogis(pred_logit$fit),
cauchit_GLM = binomial(link = "cauchit")$linkinv(pred_cauchit$fit)
)
rownames(tab) <- c(40, 45, 50, 55, 60)
round(tab, 2)## logit_GAM logit_GLM cauchit_GLM
## 40 0.08 0.14 0.07
## 45 0.18 0.32 0.15
## 50 0.51 0.56 0.57
## 55 0.87 0.78 0.88
## 60 0.97 0.91 0.93
Let’s look at the same proportion in the raw data. We cannot be as strict otherwise we won’t get much individuals, so let’s enlarge a bit the selection and see that we get a few hundred individuals at least:
tab <- with(HSE98women,
rbind(sum(age >= 39 & age <= 41 & bmi > 18.5 & bmi < 25, na.rm = TRUE),
sum(age >= 44 & age <= 46 & bmi > 18.5 & bmi < 25, na.rm = TRUE),
sum(age >= 49 & age <= 51 & bmi > 18.5 & bmi < 25, na.rm = TRUE),
sum(age >= 54 & age <= 56 & bmi > 18.5 & bmi < 25, na.rm = TRUE),
sum(age >= 59 & age <= 61 & bmi > 18.5 & bmi < 25, na.rm = TRUE)
)
)
rownames(tab) <- c(40, 45, 50, 55, 60)
tab## [,1]
## 40 199
## 45 160
## 50 172
## 55 132
## 60 97
That seems a good selection process (+/- 1 year and the whole WHO normal BMI category), so let’s check the proportion of menopaused individuals in each of these categories (keeping both smokers and non smokers as the effect is rather small, see later):
tab <- with(HSE98women,
rbind(
mean(menopause[age >= 39 & age <= 41 & bmi > 18.5 & bmi < 25], na.rm = TRUE),
mean(menopause[age >= 44 & age <= 46 & bmi > 18.5 & bmi < 25], na.rm = TRUE),
mean(menopause[age >= 49 & age <= 51 & bmi > 18.5 & bmi < 25], na.rm = TRUE),
mean(menopause[age >= 54 & age <= 56 & bmi > 18.5 & bmi < 25], na.rm = TRUE),
mean(menopause[age >= 59 & age <= 61 & bmi > 18.5 & bmi < 25], na.rm = TRUE)
)
)
rownames(tab) <- c(40, 45, 50, 55, 60)
round(tab, 2)## [,1]
## 40 0.08
## 45 0.12
## 50 0.52
## 55 0.86
## 60 0.98
These results are roughly consistent with the predictions from the GAM model. It confirms that the GLM logit and cauchit are underestimating the prevalence of menopause at old age. All models however may overestimate the prevalence of menaupose at young age. Depending on the goal of the study, other link functions, not implemented in R base, could be tried but it is quite advanced so we won’t do that.
We can now turn to the exploration of the smoking effect by comparing predictions between smokers and non smokers at median BMI:
Selecting the best model, let’s now look at the effect of BMI, still using smokers:
data.for.pred.nosmoke <- expand.grid(
age = age.for.pred,
bmi = median(mod_menop_cauchit_gam$model$bmi),
smoked = FALSE
)
gam.p.nosmoke <- predict(mod_menop_logit_gam, newdata = data.for.pred.nosmoke, se.fit = TRUE)
gam.upr.nosmoke <- binomial(link = "logit")$linkinv(gam.p.nosmoke$fit + qnorm(0.975) * gam.p.nosmoke$se.fit)
gam.lwr.nosmoke <- binomial(link = "logit")$linkinv(gam.p.nosmoke$fit + qnorm(0.025) * gam.p.nosmoke$se.fit)
gam.fit.nosmoke <- binomial(link = "logit")$linkinv(gam.p.nosmoke$fit)
plot(gam.fit ~ age.for.pred, type = "l", las = 1, ylab = "Probability of being menopaused", xlab = "Age (yrs)", ylim = c(0, 1), col = "red")
points(gam.upr ~ age.for.pred, type = "l", lty = 2, col = "red")
points(gam.lwr ~ age.for.pred, type = "l", lty = 2, col = "red")
points(gam.fit.nosmoke ~ age.for.pred, type = "l", col = "green")
points(gam.upr.nosmoke ~ age.for.pred, type = "l", lty = 2, col = "green")
points(gam.lwr.nosmoke ~ age.for.pred, type = "l", lty = 2, col = "green")
legend("topleft", fill = c("green", "red"), legend = c("non-smoker", "smoker"), bty = "n")Although significant, the effect of smoking is rather small. Because the interaction between the smoking status and the BMI was not significant, we won’t draw predictions to explore that.
## body brain
## Arctic fox 3.385 44.5
## Owl monkey 0.480 15.5
## Mountain beaver 1.350 8.1
## Cow 465.000 423.0
## Grey wolf 36.330 119.5
## Goat 27.660 115.0
## 'data.frame': 62 obs. of 2 variables:
## $ body : num 3.38 0.48 1.35 465 36.33 ...
## $ brain: num 44.5 15.5 8.1 423 119.5 ...
plot(log(brain, 10) ~ log(body, 10), data = mammals, pch = ".")
text(log(mammals$body, 10), log(mammals$brain, 10),
labels = rownames(mammals), cex = 0.8)## Influence measures of
## lm(formula = log10_brain ~ log10_body, data = mammals) :
##
## dfb.1_ dfb.l10_ dffit cov.r cook.d hat inf
## Arctic fox 0.12929 -0.005286 0.13865 1.011 9.58e-03 0.0162
## Owl monkey 0.26067 -0.147324 0.26503 0.961 3.40e-02 0.0233
## Mountain beaver -0.05218 0.016630 -0.05237 1.048 1.39e-03 0.0179
## Cow -0.04138 -0.211577 -0.25174 1.055 3.16e-02 0.0549
## Grey wolf -0.00609 -0.007042 -0.01197 1.060 7.28e-05 0.0247
## Goat 0.01413 0.013610 0.02525 1.057 3.24e-04 0.0227
## Roe deer 0.05871 0.034610 0.08615 1.041 3.75e-03 0.0192
## Guinea pig -0.09246 0.035742 -0.09247 1.039 4.32e-03 0.0190
## Verbet 0.14358 0.004869 0.15857 0.999 1.25e-02 0.0161
## Chinchilla 0.08146 -0.048110 0.08326 1.050 3.51e-03 0.0242
## Ground squirrel 0.25867 -0.219237 0.28821 1.003 4.08e-02 0.0383
## Arctic ground squirrel -0.06762 0.028197 -0.06763 1.047 2.32e-03 0.0195
## African giant pouched rat -0.04995 0.019801 -0.04995 1.050 1.27e-03 0.0191
## Lesser short-tailed shrew -0.04166 0.050503 -0.05574 1.135 1.58e-03 0.0901 *
## Star-nosed mole -0.00550 0.005087 -0.00635 1.083 2.05e-05 0.0451
## Nine-banded armadillo -0.12039 0.003547 -0.12965 1.016 8.40e-03 0.0161
## Tree hyrax -0.02700 0.005614 -0.02756 1.050 3.86e-04 0.0168
## N.A. opossum -0.13177 0.033606 -0.13332 1.017 8.89e-03 0.0172
## Asian elephant 0.00691 0.169305 0.18752 1.119 1.78e-02 0.0873 *
## Big brown bat -0.15335 0.160920 -0.18818 1.080 1.78e-02 0.0600
## Donkey -0.00234 -0.007012 -0.00896 1.079 4.08e-05 0.0416
## Horse -0.01962 -0.107873 -0.12749 1.086 8.23e-03 0.0568
## European hedgehog -0.14647 0.066672 -0.14677 1.019 1.08e-02 0.0203
## Patas monkey 0.13081 0.051285 0.17243 0.996 1.47e-02 0.0177
## Cat 0.03635 -0.001799 0.03885 1.048 7.66e-04 0.0162
## Galago 0.16730 -0.122879 0.17828 1.032 1.59e-02 0.0307
## Genet 0.09091 -0.027906 0.09134 1.037 4.21e-03 0.0178
## Giraffe -0.01805 -0.100220 -0.11834 1.088 7.09e-03 0.0570
## Gorilla -0.01057 -0.033490 -0.04238 1.079 9.13e-04 0.0430
## Grey seal 0.03027 0.058263 0.08230 1.061 3.43e-03 0.0323
## Rock hyrax-a 0.12403 -0.057785 0.12438 1.030 7.77e-03 0.0206
## Human 0.21965 0.352501 0.52677 0.797 1.22e-01 0.0292 *
## African elephant 0.00077 -0.049868 -0.05399 1.161 1.48e-03 0.1098 *
## Water opossum -0.31028 0.009141 -0.33415 0.845 5.09e-02 0.0161 *
## Rhesus monkey 0.26321 0.058332 0.31729 0.868 4.65e-02 0.0167 *
## Kangaroo -0.09323 -0.105229 -0.18077 1.015 1.63e-02 0.0244
## Yellow-bellied marmot -0.05936 -0.001288 -0.06523 1.042 2.15e-03 0.0161
## Golden hamster -0.13927 0.114256 -0.15340 1.051 1.18e-02 0.0362
## Mouse -0.06543 0.068659 -0.08029 1.096 3.27e-03 0.0600
## Little brown bat -0.01958 0.022374 -0.02523 1.118 3.24e-04 0.0755 *
## Slow loris 0.02667 -0.008240 0.02680 1.052 3.65e-04 0.0178
## Okapi -0.00660 -0.023306 -0.02900 1.083 4.27e-04 0.0456
## Rabbit -0.05933 0.008300 -0.06160 1.043 1.92e-03 0.0164
## Sheep 0.00117 0.001766 0.00270 1.064 3.70e-06 0.0282
## Jaguar -0.05098 -0.107563 -0.14822 1.048 1.11e-02 0.0341
## Chimpanzee 0.10845 0.157062 0.24339 0.992 2.91e-02 0.0276
## Baboon 0.19253 0.080361 0.25728 0.934 3.17e-02 0.0179
## Desert hedgehog -0.17726 0.094998 -0.17929 1.009 1.60e-02 0.0224
## Giant armadillo -0.08742 -0.137621 -0.20701 1.015 2.13e-02 0.0289
## Rock hyrax-b -0.00906 0.000179 -0.00980 1.051 4.88e-05 0.0161
## Raccoon 0.07343 0.003105 0.08138 1.037 3.34e-03 0.0162
## Rat -0.12585 0.084698 -0.13144 1.042 8.69e-03 0.0276
## E. American mole -0.00145 0.001296 -0.00165 1.080 1.39e-06 0.0421
## Mole rat 0.14010 -0.114569 0.15415 1.050 1.20e-02 0.0360
## Musk shrew -0.27530 0.263044 -0.32233 1.015 5.11e-02 0.0483
## Pig -0.07152 -0.217294 -0.27696 1.018 3.79e-02 0.0420
## Echidna 0.04523 -0.003677 0.04782 1.046 1.16e-03 0.0162
## Brazilian tapir -0.06894 -0.189122 -0.24564 1.025 2.99e-02 0.0396
## Tenrec -0.22939 0.096893 -0.22948 0.967 2.56e-02 0.0196
## Phalanger -0.01210 0.003252 -0.01222 1.052 7.59e-05 0.0174
## Tree shrew 0.12605 -0.106264 0.14018 1.057 9.91e-03 0.0379
## Red fox 0.11779 0.004449 0.13030 1.015 8.48e-03 0.0161
##
## Call:
## lm(formula = log10_brain ~ log10_body, data = mammals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.74503 -0.21380 -0.02676 0.18934 0.84613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.92713 0.04171 22.23 <2e-16 ***
## log10_body 0.75169 0.02846 26.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3015 on 60 degrees of freedom
## Multiple R-squared: 0.9208, Adjusted R-squared: 0.9195
## F-statistic: 697.4 on 1 and 60 DF, p-value: < 2.2e-16
mod_nohuman <- lm(log10_brain ~ log10_body, data = mammals[rownames(mammals)!="Human", ])
plot(mod_nohuman, which = 4)t <- (mod$coefficients[2] - 2/3) / sqrt(diag(vcov(mod))[2])
2*pt(abs(t), df = mod$df.residual, lower.tail = FALSE)## log10_body
## 0.00407627
## 2.5 % 97.5 %
## (Intercept) 0.8436923 1.0105616
## log10_body 0.6947503 0.8086215
## Analysis of Variance Table
##
## Model 1: log10_brain ~ log10_body
## Model 2: log10_brain ~ 1 + offset(2/3 * log10_body)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 60 5.4552
## 2 61 6.2663 -1 -0.81117 8.9219 0.004076 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
t <- (mod_nohuman$coefficients[2] - 2/3) / sqrt(diag(vcov(mod_nohuman))[2])
2*pt(abs(t), df = mod_nohuman$df.residual, lower.tail = FALSE)## log10_body
## 0.006644006
## 2.5 % 97.5 %
## (Intercept) 0.8400581 0.9970116
## log10_body 0.6885052 0.7960480
mod2_nohuman <- lm(log10_brain ~ 1 + offset(2/3*log10_body), data = mammals[rownames(mammals)!="Human", ])
anova(mod_nohuman, mod2_nohuman)## Analysis of Variance Table
##
## Model 1: log10_brain ~ log10_body
## Model 2: log10_brain ~ 1 + offset(2/3 * log10_body)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 59 4.7177
## 2 60 5.3507 -1 -0.63303 7.9167 0.006644 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can plot the predictions from the models
plot(log(brain, 10) ~ log(body, 10), data = mammals, pch = ".")
text(log(mammals$body, 10), log(mammals$brain, 10),
labels = rownames(mammals), cex = 0.8)
abline(mod, col = "blue", lwd = 2)
abline(mod_nohuman, col = "blue", lwd = 2, lty = 2)
abline(coef(mod2), 2/3, col = "red", lwd = 2)
abline(coef(mod2_nohuman), 2/3, col = "red", lwd = 2, lty = 2)
legend("topleft", lty = c(1, 1, 2, 2), lwd = c(2, 2, 2, 2), col = c("blue", "red", "blue", "red"),
legend = c("slope estimated", "slope = 2/3", "slope estimated (no human)", "slope = 2/3 (no human)"))## 1
## 8.289624
An animal of 1 Kg is predicted to have a brain size of 8.29 grams.
## [,1]
## [1,] "Human"
## [2,] "Rhesus monkey"
## [3,] "Baboon"
## [4,] "Owl monkey"
## [5,] "Chimpanzee"
## [6,] "Ground squirrel"
## [7,] "Patas monkey"
## [8,] "Verbet"
## [9,] "Arctic fox"
## [10,] "Red fox"
## [11,] "Galago"
## [12,] "Rock hyrax-a"
## [13,] "Mole rat"
## [14,] "Tree shrew"
## [15,] "Genet"
## [16,] "Raccoon"
## [17,] "Roe deer"
## [18,] "Asian elephant"
## [19,] "Chinchilla"
## [20,] "Grey seal"
## [21,] "Echidna"
## [22,] "Cat"
## [23,] "Slow loris"
## [24,] "Goat"
## [25,] "Sheep"
## [26,] "E. American mole"
## [27,] "Star-nosed mole"
## [28,] "Donkey"
## [29,] "Grey wolf"
## [30,] "Rock hyrax-b"
## [31,] "Little brown bat"
## [32,] "Phalanger"
## [33,] "Okapi"
## [34,] "African elephant"
## [35,] "Lesser short-tailed shrew"
## [36,] "Gorilla"
## [37,] "Tree hyrax"
## [38,] "Mouse"
## [39,] "African giant pouched rat"
## [40,] "Mountain beaver"
## [41,] "Giraffe"
## [42,] "Rabbit"
## [43,] "Arctic ground squirrel"
## [44,] "Horse"
## [45,] "Yellow-bellied marmot"
## [46,] "Guinea pig"
## [47,] "Big brown bat"
## [48,] "Rat"
## [49,] "Jaguar"
## [50,] "Golden hamster"
## [51,] "N.A. opossum"
## [52,] "Nine-banded armadillo"
## [53,] "European hedgehog"
## [54,] "Cow"
## [55,] "Kangaroo"
## [56,] "Desert hedgehog"
## [57,] "Giant armadillo"
## [58,] "Brazilian tapir"
## [59,] "Pig"
## [60,] "Musk shrew"
## [61,] "Tenrec"
## [62,] "Water opossum"
## body brain
## Mountain beaver 1.35 8.1
## Cow 465.00 423.0
## Grey wolf 36.33 119.5
## Goat 27.66 115.0
## Guinea pig 1.04 5.5
## Dipliodocus 11700.00 50.0
## 'data.frame': 28 obs. of 2 variables:
## $ body : num 1.35 465 36.33 27.66 1.04 ...
## $ brain: num 8.1 423 119.5 115 5.5 ...
plot(log(brain, 10) ~ log(body, 10), data = Animals, pch = ".")
text(log(Animals$body, 10), log(Animals$brain, 10),
labels = rownames(Animals), cex = 0.8)## Influence measures of
## lm(formula = log10_brain ~ log10_body, data = Animals) :
##
## dfb.1_ dfb.l10_ dffit cov.r cook.d hat inf
## Mountain beaver -0.10905 0.07464 -0.10914 1.144 6.15e-03 0.0671
## Cow 0.01387 0.03645 0.06760 1.131 2.37e-03 0.0504
## Grey wolf 0.04125 -0.00271 0.05619 1.114 1.64e-03 0.0358
## Goat 0.05386 -0.00834 0.06889 1.111 2.46e-03 0.0362
## Guinea pig -0.16207 0.11505 -0.16207 1.135 1.35e-02 0.0720
## Dipliodocus 0.19057 -0.76224 -0.91399 0.782 3.47e-01 0.1173 *
## Asian elephant -0.02278 0.29807 0.40291 1.015 7.85e-02 0.0789
## Donkey 0.04730 0.04451 0.12131 1.098 7.55e-03 0.0413
## Horse 0.02354 0.07122 0.12783 1.114 8.40e-03 0.0518
## Potar monkey 0.13134 -0.05296 0.14365 1.086 1.05e-02 0.0413
## Cat 0.01461 -0.00850 0.01487 1.142 1.15e-04 0.0530
## Giraffe 0.02409 0.07430 0.13279 1.112 9.05e-03 0.0520
## Gorilla 0.04094 0.04320 0.11117 1.104 6.35e-03 0.0421
## Human 0.21767 0.03310 0.34600 0.882 5.52e-02 0.0360
## African elephant -0.06446 0.32550 0.40414 1.076 8.03e-02 0.1017
## Triceratops 0.14087 -0.60944 -0.73995 0.881 2.42e-01 0.1110
## Rhesus monkey 0.23101 -0.10936 0.24421 1.025 2.95e-02 0.0447
## Kangaroo -0.02731 0.00215 -0.03686 1.119 7.05e-04 0.0358
## Golden hamster -0.39051 0.33844 -0.39973 1.135 7.96e-02 0.1261
## Mouse -0.52481 0.49630 -0.55286 1.193 1.51e-01 0.1840
## Rabbit -0.08303 0.05118 -0.08382 1.136 3.64e-03 0.0569
## Sheep 0.05082 0.00515 0.07795 1.107 3.14e-03 0.0359
## Jaguar 0.01475 0.00615 0.02800 1.122 4.07e-04 0.0375
## Chimpanzee 0.13396 0.00995 0.20160 1.030 2.02e-02 0.0358
## Rat -0.29329 0.23887 -0.29631 1.133 4.43e-02 0.1020
## Brachiosaurus 0.40128 -1.07770 -1.19872 0.849 5.97e-01 0.1863 *
## Mole -0.10487 0.09079 -0.10732 1.229 5.97e-03 0.1256
## Pig 0.00159 0.00154 0.00413 1.128 8.87e-06 0.0415
##
## Call:
## lm(formula = log10_brain ~ log10_body, data = Animals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4284 -0.2937 0.1440 0.3755 1.1220
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.10958 0.17942 6.184 1.53e-06 ***
## log10_body 0.49599 0.07817 6.345 1.02e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6652 on 26 degrees of freedom
## Multiple R-squared: 0.6076, Adjusted R-squared: 0.5925
## F-statistic: 40.26 on 1 and 26 DF, p-value: 1.017e-06
Animals_nodino <- Animals[!rownames(Animals) %in%
c("Triceratops", "Brachiosaurus", "Dipliodocus"), ]
mod_nodino <- lm(log10_brain ~ log10_body, data = Animals_nodino)
plot(mod_nodino)## Influence measures of
## lm(formula = log10_brain ~ log10_body, data = Animals_nodino) :
##
## dfb.1_ dfb.l10_ dffit cov.r cook.d hat inf
## Mountain beaver -0.10881 0.071003 -0.10895 1.158 6.16e-03 0.0695
## Cow -0.01015 -0.212504 -0.30351 1.077 4.59e-02 0.0785
## Grey wolf -0.01178 -0.003483 -0.01990 1.139 2.07e-04 0.0413
## Goat 0.01800 0.002504 0.02737 1.137 3.91e-04 0.0403
## Guinea pig -0.19216 0.131714 -0.19216 1.135 1.89e-02 0.0754
## Asian elephant -0.03879 0.182189 0.21825 1.224 2.45e-02 0.1320
## Donkey -0.00337 -0.009563 -0.01677 1.161 1.47e-04 0.0593
## Horse -0.00243 -0.111533 -0.15646 1.160 1.26e-02 0.0813
## Potar monkey 0.22103 -0.057315 0.25724 0.999 3.24e-02 0.0421
## Cat 0.06210 -0.031964 0.06390 1.147 2.13e-03 0.0533
## Giraffe -0.00192 -0.104180 -0.14581 1.165 1.10e-02 0.0817
## Gorilla -0.00996 -0.032457 -0.05526 1.159 1.59e-03 0.0611
## Human 0.32062 0.228866 0.69981 0.526 1.74e-01 0.0448 *
## African elephant 0.02117 -0.072423 -0.08265 1.317 3.57e-03 0.1723 *
## Rhesus monkey 0.49177 -0.178325 0.53979 0.697 1.19e-01 0.0449 *
## Kangaroo -0.14080 -0.038372 -0.23433 1.018 2.71e-02 0.0411
## Golden hamster -0.32757 0.288238 -0.33865 1.204 5.82e-02 0.1452
## Mouse -0.17546 0.170672 -0.18836 1.392 1.84e-02 0.2235 *
## Rabbit -0.11782 0.066261 -0.11964 1.136 7.40e-03 0.0577
## Sheep -0.00100 -0.000613 -0.00206 1.143 2.23e-06 0.0439
## Jaguar -0.06117 -0.079749 -0.17930 1.089 1.63e-02 0.0499
## Chimpanzee 0.14753 0.082366 0.29419 0.968 4.16e-02 0.0434
## Rat -0.28176 0.230127 -0.28604 1.164 4.16e-02 0.1134
## Mole 0.31175 -0.273969 0.32217 1.209 5.28e-02 0.1445
## Pig -0.06495 -0.190697 -0.33188 0.999 5.34e-02 0.0597
##
## Call:
## lm(formula = log10_brain ~ log10_body, data = Animals_nodino)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39628 -0.20636 -0.06760 0.08427 0.83832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.93391 0.08712 10.72 2.03e-10 ***
## log10_body 0.75226 0.04572 16.45 3.24e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3152 on 23 degrees of freedom
## Multiple R-squared: 0.9217, Adjusted R-squared: 0.9183
## F-statistic: 270.7 on 1 and 23 DF, p-value: 3.243e-14
plot(log10_brain ~ log10_body, data = Animals, col = NULL)
text(x = Animals$log10_body, y = Animals$log10_brain,
labels = rownames(Animals), cex = 0.8,
col = ifelse( rownames(Animals) %in% c("Triceratops", "Brachiosaurus", "Dipliodocus"), "red", "black"))
p <- predict(mod, newdata = data.frame("log10_body" = range(log(Animals$body, 10))))
p_nodino <- predict(mod_nodino, newdata = data.frame("log10_body" = range(log(Animals$body, 10))))
points(x = range(Animals$log10_body), y = p, col = "red", type = "l", lty = 2)
points(x = range(Animals$log10_body), y = p_nodino, col = "blue", type = "l")